Data and Visualization
Increasingly, as we participate in social movement activity we leave data traces across the web: tweets, facebook updates and likes, IRC conversations, and other activities across the net produce information that can later be gathered, analyzed, mined, and visualized. Web companies do this constantly; for most, gathering, analyzing, packaging, and selling user data is a main source of revenue. Intelligence agencies are also investing increasing resources in automated extraction of information from the social web. These developments have serious implications for privacy. At the same time, the tools to gather, analyze, and visualize large datasets are increasingly available to more people than ever before, including to researchers, small organizations, and everyday individuals. This page is for sharing datasets, as well as for sharing information about how Occupy Researchers might collaborate to gather, share, analyze, and visualize data about the movement.
Initial motivating questions include:
- Rather than have 50 people scraping twitter and other sites for the same dataset, can we pool this work?
- Can people who are assembling a DB share a link to the tools you're using, visualizations, the data itself?
Data Sets
Codebooks
Events
Maybe we can get a chat soon to talk about all this. Write here if you are interested:
- 03.23-24.2012 Occupy Data Hackathon 2. http://bit.ly/occupyhackathon
- 12.09.2011 R-Shief Occupy Data hackathon in December (December 9th-11th): Any updated info about this?
Document to prepare the event: http://bit.ly/occupyhackathon - 11.19.2011 Break out from 4th conference call about Data Mining. Nov. 19th. 1pm Eastern / 10 am Pacific http://bit.ly/orcall4 (see notes at the end of document)
- 11.5.2011 Break out from 2nd conference call about Data Mining. Nov 5th 2011: http://bit.ly/ORcall2 (see notes at the end of document)
Works-in-Progress
Lab notebook style record of work that's underway but not yet ready for prime time.
Discussion - Mailing list
Stay updated of the discussion of this group and sign up at the mailing list
http://groups.google.com/group/occupyresearch-data
I've just found another #occupydata list:
http://groups.google.com/group/occupydata
Wishlist
The link below points to a wishlist for researchers to express analyses, transformations, and reports they'd like to be able to make with Twitter data. It's not strictly for Occupy-related research but seems like a useful tool for us, too. Contributing to this page is as simple as writing a sentence starting with "I wish..."
Visualizations
Prior art
Distributed data collection
Shared data
Occupydata.org (coming soon)
Data on the DataHub: thedatahub.org/group/occupyMapBox: tiles.mapbox.com/occupyGithub: https://github.com/occupydc
Coordinated Twitter searching
DIY data mining on Twitter:
- Rate limits and unpredictable fluctuations in keyword popularity make it imposible for one person to get a comprehensive data set without having access to the ($$$$$$$$) "firehose"
- However, we should be able to coordinate our searches using a shared database / keyword server so that the burden is distributed
- It's not meant to circumvent the rate limit but to maximize the utility of each query we make to the API so that we are not collecting redundant data
- Useful parse tools in Scraperwiki: https://scraperwiki.com/search/occupy/
https://scraperwiki.com/scrapers/tweet_search_saving_all_meta_data/How to do for the older ones?
Some quick+dirty experiments this week with the regular old Search API and the Streaming API demonstrate the fundamental challenge:
- Seemed to capture everything for #occupysandiego without hitting the rate limit
- Totally failed with #occupyla. Hit the rate limit every hour (thereby missing tweets)
Next steps:
- Brainstorm and prototype a coordinated capture method (the code shouldn't be too tricky if the design is ok...)
- Run an experiment in collaboration with someone who has access to the twitter "fireho$e." That way, we can assess the weaknesses of the distributed solution
Prior art
Chirper (last update June 2011):
Additional:
Kevin Driscoll - currently have access to the “firehose” ($$$) and collecting tweets on a few 100 keywords. Have a complete data set since ~Oct 12. Working on an exemption to licensing so that we can share this data with others. Especially important to merge data bc it is very difficult to acquire a complete corpus of relevant tweets. Analysis is very rudimentary at this point: frequency, keyword discovery, collecting links. Duration of access to firehose is uncertain because of cost. Very interested in a distributed search method that will coordinate multiple clients and use firehose as a benchmark to test this. Also, simple search client (non firehose) here:
https://github.com/driscoll/quickndirty
Twitter Spam on #OccupyBoston
From Takis:
"Since my research focuses on information reliability and trust in the real-time web, I put up a script connecting to the Twitter Streaming API and collected data for about 22 hours, containing one of the phrases: #occupyboston
<https://plus.google.com/u/0/s/٪23occupyboston>, Occupy Boston, Boston PD, #bpd
<https://plus.google.com/u/0/s/٪23bpd>. In total I received 17094 tweets by 8367 unique users. Searching their description field for the phrase ±over 18½, I discovered 570 accounts who had sent 681 tweets. That is, 6.8٪ of users tweeting about the #occupyboston
<https://plus.google.com/u/0/s/٪23occupyboston>movement were spam bots of pornographic accounts. This might look like a small number, but keep in mind that I only searched for one kind of spam accounts (containing the words ±over 18½). The point I want to make though is why do these accounts appear in the real-time stream at all. Twitter claims to use quality criteria in deciding what content is displayed in the public search stream. Donφt the URLs of these accounts raise a red-flag in Twitterφs quality algorithms since they are pointing to pornographic websites? Yes, I know that pornography is legal, but I was not searching for it, so why should I get to see pornographic accounts, especially some that are against Twitter rules (pornographic images in profile picture, copying content from other users, etc.
https://support.twitter.com/articles/18311-the-twitter-rules) So, Twitter, please use some better algorithms in filtering content."
Twitter Firehose
Does anyone have ready access to the Twitter "firehose" that they can share?
- Starting Oct 15 and ending Oct 19, we'll have a complete corpus of tweets from the keywords below. After that, we need an alternate solution or someone else with a firehose we can piggyback on.
Shared DB
- What is the best solution for sharing a DB?
- Should we set one up on a server and wrap a simple API around it?
- Is it necessary to pursue a "cloud" platform like EC2 for reliability?
Keywords for Twitter analysis
Keyword discovery
TODO: a simple script that identifies and suggests new keywords based on search results from the existing list
Sample list
note: perhaps run queries on occupy* where * is wildcard. Rather than try to keep up with adding each city.
note: the list below was compiled for the somewhat idiosyncratic requirements of the gnip powertrack system.
Last update: October 30, 9:06PM PT
#nypd
#occupy
#ocs
#ows
#sdpd
99xmas
americanautumn
fraudclosure
freedomco
globalrevlive
globalrevolution
holidays99pct
holidayhomemade
iamthe53
iamthe99
moveyourmoney
occupy_boston
occupy_okc
occupyalabama
occupyalbany
occupyalbuquerque
occupyallentown
occupyallstreet
occupyannarbor
occupyarkansas
occupyashland
occupyashville
occupyathens
occupyatlanta
occupyatlantaga
occupyaustin
occupybaltimore
occupybaystreet
occupybeantown
occupyberkeley
occupybgm
occupybham
occupybinghamton
occupybloomington
occupyboise
occupybos_media
occupyboston
occupybuf
occupybuffalo
occupyburlington
occupycanada
occupycarbondale
occupycharlotte
occupychattanooga
occupychi
occupychicago
occupychico
occupychristmas
occupycincinnati
occupycincy
occupycleveland
occupycolleges
occupycolorado
occupycoloradosprings
occupycolumbus
occupycouch
occupycsu
occupydallas
occupydayton
occupydc
occupydcneeds
occupydenver
occupydesmoines
occupydetroit
occupyearth
occupyeducation
occupyelkhart
occupyelpaso
occupyeugene
occupyeureka
occupyeverywhere
occupyfindlay
occupyflagstaff
occupyflorida
occupyfortcollins
occupyfortwayne
occupyfresno
occupygrandrapids
occupyhonolulu
occupyhouston
occupyidahofalls
occupyindianapolis
occupyindy
occupyinlandempire
occupyinternet
occupyiowacity
occupyithaca
occupyjacksonville
occupyjax
occupykansascity
occupyketchum
occupyknoxville
occupykst
occupykstreet
occupyla
occupylansing
occupylasvegas
occupylawrence
occupylexington
occupylondon
occupylosangeles
occupylouisville
occupylsx
occupymadison
occupymarines
occupymaui
occupymcallen
occupymemorial
occupymemorialdr
occupymemphis
occupymia
occupymiami
occupyminneapolis
occupymissoula
occupymn
occupymoscow
occupymuseums
occupynashville
occupynation
occupynb
occupyneworleans
occupynewyorkcity
occupynj
occupynola
occupynorfolk
occupynorthampton
occupyns
occupyny
occupynyc
occupyoakland
occupyocala
occupyoklahomacity
occupyorlandofl
occupyoxnard
occupyphiladelphia
occupyphilly
occupyphoenix
occupyphx
occupyportland
occupyprov
occupyprovidence
occupyraleigh
occupyredlands
occupyresearch
occupyriverside
occupyrochester
occupyrockford
occupysac
occupysacramento
occupysalem
occupysaltlakecity
occupysanantonio
occupysandiego
occupysanfran
occupysanfrancisco
occupysanjose
occupysanluisobispo
occupysantabarbara
occupysantacruz
occupysantafe
occupysarasota
occupysavannah
occupysd
occupyseaside
occupyseattle
occupysf
occupyslc
occupyslo
occupysolidarity
occupysouthbend
occupysouthgate
occupyspokane
occupyspringfield
occupystandrews
occupystl
occupystlouis
occupysydney
occupysyracuse
occupytampa
occupythehood
occupythemedia
occupythenation
occupytogether
occupytoronto
occupytrenton
occupytucson
occupyus
occupyusa
occupyvancouver
occupyventura
occupyvermont
occupyvictoria
occupywallst
occupywallstnyc
occupywallstreet
occupywashingtondc
occupywichita
occupywinnipeg
occupywinstonsalem
occupyworcester
occupywriters
occupyx
occupyxmas
oct15
october15
owslosangeles
reclaimuc
theother99
wearethe53
wearethe99
weoccupyamerica
zuccotti
#moveyourmoney
#moveourmoney
"Credit Union"
"Credit Unions"
#banktransferday
#breakthebanks
creditunion
creditunions
"big banks"
BofA
"Bank of America"
"vampire squid"
"Too Big to Fail"
"TBTF"
"local bank"
"community bank"
"debit fees"
"debit card"
"wells fargo"
"good-guy banks"
"zombie banks"
#opcashback
#operationreturnmail